Main packages used: haven, readxl, readr
Main functions covered: readr::read_csv, readxl::read_excel, haven::read_stata haven::read_spss, haven::read_sas, write.table()

Supplementary resources:

1 Importing data into R

1.1 Packages for importing data

We saw how we can create data within R, most of the time you need to load your dataset into R to start the analysis.

At this point we want more than what base R can offer to us. Let’s install and load some packages! Packages are the cornerstone of the R ecosystem: there are thousands of super useful packages (the most common repository for them is CRAN). Whenever you face a specific problem (that can be highly domain specific) there is a good chance that there is at least one package that offers a solution.

An R package is a collection of functions that works much the same way as we saw earlier. These functions and packages are written by R users and shared with the community. The focus and range of these packages are wide: from data cleaning, to data visualization, through ecological and environmental data analysis there is a package for everyone and everything. This ample supply might be intimidating first but this also means that there is a solution out there to a given problem.

To install a package from the CRAN repository we will use the install.packages() function. Note that it requres the package’s name as a character.

# data import / export
install.packages("readr")
install.packages("haven")
install.packages("readxl")

# exploring data
install.packages("gapminder")
install.packages("dplyr")
install.packages("ggplot2")

After you installed a given package we need to load it to be able to use its functions. We do this by the library() command. It is good practice that you load all the packages at the beggining of your script.

# data import / export
library(readr) # importing various data formats into R
library(haven) # importing data formats from other programs (Stata, SAS, SPSS)
#> Warning: package 'haven' was built under R version 3.4.4
library(readxl) # importing Excel tables into R
#> Warning: package 'readxl' was built under R version 3.4.4

# exploring data
library(gapminder) # data we will use
library(dplyr) # data manipulation package
#> Warning: package 'dplyr' was built under R version 3.4.4
library(ggplot2) # data visulization package
#> Warning: package 'ggplot2' was built under R version 3.4.4

Important note: whenever there is a conflicting function name (e.g:two packages have the same function name) you can specify what function you want to use with the package::function syntax. Below, when loading in the data, I use readr::read_csv to signify which package the function comes from.

1.1.1 Text (.csv)

We will look at the Quality of Government basic data set and import it with different file extensions. First let’s load the .csv file (stands for comma separated values). You can either load it from your project folder or directly from the GitHub repo. We are using the readr package that can read comma separated values with the read_csv function. It is a specific case of the read_delim function, where you can specify the character that is used as a delimiter (e.g.: in Europe comma is used as a decimal, so the delimiter is often a semicolon.)

(the codebook for the dataset is here, if you are interested: https://www.qogdata.pol.gu.se/data/qog_bas_jan18.pdf)

Loading it from the GitHub repository requires only the url of the data file and then the read_csv function will take care of the rest.

qog_text <- readr::read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/qog_bas_cs_jan18.csv")

If you have the data downloaded you can load it from your project folder. In the code below, I put the data file into the data folder within my project folder. (the path looks like this: mydrive:/folder/project_folder/data). The "\data\file.csv is called the relative path, as when using project we do not need to type out the whole path to the file, just its relative location to our main project folder.


qog_text <- read_delim("data/qog_bas_cs_jan18.csv", delim = ",")

With the readr::read_csv I specified that I use the function from that specific package. The package::function is useful if there are conflicting functions in the loaded packages or you want to make your package use explicit when functions have very similar names. In this case, base R also have a read.csv function, that is a bit slower than the one in readr.

1.1.2 Excel (.xls and .xlsx)

Next we are loading the excel file. We use the readxl package’s read_excel function to load the file (it does not support opening files via urls unfortunately).

qog_excel <- read_excel("data/qog_bas_cs_jan18.xlsx")

1.1.3 Stata (.dta) SPSS (.sav), and SAS (.sas7bdat)

Importing data files from Stata 13+, SPSS, and SAS is similarly easy, using the haven package. If you have a data file that is not in these formats (or collaborators who work with weird software choices) you can check the foreign and rio packages. You also have some capability in base R but it is quite picky about software versions (check the read.* functions).

# read the Stata .dta file
qog_stata <- read_stata("data/qog_bas_cs_jan18.dta")

# read the SPSS .sav file
qog_spss <- read_spss("data/qog_bas_cs_jan18.sav")

# read the SAS .SAS7BDAT file
beer_sas <- read_sas("data/beer.sas7bdat")

To remove the unnecesary objects, we can use the rm() function.

rm(beer_sas, qog_excel, qog_stata, qog_spss)

To remove everything from your environment, you can use the rm(list=ls()). Note: this will only delete the objects in your environment, but the packages stay loaded!

1.1.4 Importing data from APIs (OPTIONAL)

There are loads of ways to load data to R via APIs, from various sources. This subsection builds on the excellent blogpost of Data Acquisition in R. We are not going into details (if you are interested, give the linked post a read!).

  • The eurostat package let’s you import data from the…… Eurostat!

  • The wbstats is an API for World Bank data

  • The OECDpackage is providing the API for the OECD database

  • The WID package is for the World Wealth and Income Database. It is a bit trickier to download though than our trusted install.packages(). Since it is not up on the CRAN repository we need to directly download the package from their developers’ github page, which contains the source code to it. The devtools package allows just that.

install.packages("devtools")
devtools::install_github("WIDworld/wid-r-tool")
library(wid)

QUICK EXCERCISE: read the help of the read_excel function and import in data to a new object where only the first 5 columns are imported.

Solution:

#> # A tibble: 194 x 5
#>    ccode cname               ccodealp ccodecow ccodewb
#>    <dbl> <chr>               <chr>       <dbl>   <dbl>
#>  1  4.00 Afghanistan         AFG         700      4.00
#>  2  8.00 Albania             ALB         339      8.00
#>  3 12.0  Algeria             DZA         615     12.0 
#>  4 20.0  Andorra             AND         232     20.0 
#>  5 24.0  Angola              AGO         540     24.0 
#>  6 28.0  Antigua and Barbuda ATG          58.0   28.0 
#>  7 31.0  Azerbaijan          AZE         373     31.0 
#>  8 32.0  Argentina           ARG         160     32.0 
#>  9 36.0  Australia           AUS         900     36.0 
#> 10 40.0  Austria             AUT         305     40.0 
#> # ... with 184 more rows

1.1.5 Exporting data from R

Let’s export our Quality of government data out of R. We can use write.table() to create a .csv file in our project directory.

write.table(qog_text, "qog_export.csv", sep = ",")

Or we can use R’s own data format, the .Rda which is more economical in hard disk space usage. However, if you intend to share your exported data, I’d recommend the .csv output of write.table.

The peculiarity of save() is that it can save any R object, not just a data frame, so if you want to reuse a given object later and do not want to spend computational time recreating it every time, you can save it as well. It will also save the file into your working directory.

save(qog_text, file = "qog_rda.Rda")

2 Summarising and exploring data

Main packages used: base, ggplot2, dplyr
Main functions covered: head(), tail(), dplyr::glimpse() str(), summary(), table(), dplyr::group_by and summarise(), ggplot::ggplot()
Supplementary resources: ggplot2 cheat sheet

2.1 Overivew and summary statistics

We will use the gapminder package that contains macro data from the famous Hans Rosling presentation on life expectancy and regional convergence.

gapminder_df <- gapminder

These data sets are small enough that you can check them in RStudio’s data viewer (or with the View() function). However, if you have a bigger dataset or smaller memory, it can be problematic. The most basic and quickest way to check if your data loads properly is the head() function. It shows you the first few rows and columns. tail() shows you the end of the data set.

head(gapminder_df)
#> # A tibble: 6 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333       779
#> 2 Afghanistan Asia       1957    30.3  9240934       821
#> 3 Afghanistan Asia       1962    32.0 10267083       853
#> 4 Afghanistan Asia       1967    34.0 11537966       836
#> 5 Afghanistan Asia       1972    36.1 13079460       740
#> 6 Afghanistan Asia       1977    38.4 14880372       786
# let's check the end of our data
tail(gapminder_df)
#> # A tibble: 6 x 6
#>   country  continent  year lifeExp      pop gdpPercap
#>   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Zimbabwe Africa     1982    60.4  7636524       789
#> 2 Zimbabwe Africa     1987    62.4  9216418       706
#> 3 Zimbabwe Africa     1992    60.4 10704340       693
#> 4 Zimbabwe Africa     1997    46.8 11404948       792
#> 5 Zimbabwe Africa     2002    40.0 11926563       672
#> 6 Zimbabwe Africa     2007    43.5 12311143       470

If you are interested only a certain number of observations and columns, you can specify that as well with the head function or by indexing your data frame.

head(gapminder_df, 3)
#> # A tibble: 3 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333       779
#> 2 Afghanistan Asia       1957    30.3  9240934       821
#> 3 Afghanistan Asia       1962    32.0 10267083       853

or check the 1:3 rows of the first two variables.

head(gapminder_df[1:3, 1:2])
#> # A tibble: 3 x 2
#>   country     continent
#>   <fct>       <fct>    
#> 1 Afghanistan Asia     
#> 2 Afghanistan Asia     
#> 3 Afghanistan Asia

If you want to have a quick overview of your data, both str() and dplyr::glimpse are helpful.

str(gapminder_df)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
#>  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
#>  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
#>  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
#>  $ gdpPercap: num  779 821 853 836 740 ...

Both the str and glimpse function print out parts of your data set, the main differences between them and the head is they give you a complete look at your list of variables and their types.

dplyr::glimpse(gapminder_df)
#> Observations: 1,704
#> Variables: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

You can use str() on any object to see its structure (we’ll come back to this later).

Now that we know we imported our data properly and have some sense of its dimensions (check what the dim() function does!) let’s have a more in depth look!

For starters, use summary() to check some basic descriptives of our variables.

summary(gapminder_df)
#>         country        continent        year         lifeExp     
#>  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
#>  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
#>  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
#>  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
#>  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
#>  Australia  :  12                  Max.   :2007   Max.   :82.60  
#>  (Other)    :1632                                                
#>       pop              gdpPercap       
#>  Min.   :6.001e+04   Min.   :   241.2  
#>  1st Qu.:2.794e+06   1st Qu.:  1202.1  
#>  Median :7.024e+06   Median :  3531.8  
#>  Mean   :2.960e+07   Mean   :  7215.3  
#>  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
#>  Max.   :1.319e+09   Max.   :113523.1  
#> 

For the categorical variables you can use table() which generates a frequency table.

frq_table <- table(gapminder_df$continent)

frq_table
#> 
#>   Africa Americas     Asia   Europe  Oceania 
#>      624      300      396      360       24

2.2 Using the “tidyverse”

Now we dip into the tidyverse with the dplyr package.

disclaimer: I will emphasise tidyverse throughout the course, since it is one of the most useful meta-package: a collection of packages that helps to develop consistent workflow from data import to data cleaning to analysis. We will spend more time on this during the coming sessions. The tidyverse packages are a great stepping stone into R and later on you for more complex problems you can turn to tools that are outside of these packages.

https://stackoverflow.blog/2017/10/10/impressive-growth-r/

2.2.1 The pipe operator

One key advantage of the tidyverse packages is their compatibility with the use of the pipe operator: %>% (shortcut: Ctrl + Shift + M). It combines code and makes it more intuitive to work with data. Think of the %>% as a “then”. This is just a short intro, so don’t worry if this might be confusing. We will be using dplyr and other tidyverse packages a lot so you’ll have enough time to get comfortable. For every “tidy” function and solution, there exists a base R or other package so if you find something unintuitive or problematic, you can have alternative solutions if you search for it.

The %>% essentially pipes the data on it’s left as the first argument for function after it.

In our first step, we group our data by continent and then create a summary variable, that tells us the mean GDP for each continent. Check the code snippet below to see the role of each line.

# let's see the GDP by continents
gdp_cont <- gapminder_df %>% # we will use the gapminder data
    dplyr::group_by(continent) %>% # then group the data by continent
    dplyr::summarise(mean_gdp = mean(gdpPercap)) # then create a `mean_gdp` variable where we compute the mean of the grouped gdpPercap variable.

# let's see the result.
gdp_cont
#> # A tibble: 5 x 2
#>   continent mean_gdp
#>   <fct>        <dbl>
#> 1 Africa        2194
#> 2 Americas      7136
#> 3 Asia          7902
#> 4 Europe       14469
#> 5 Oceania      18622

You can also create your own summary.


# the number of observations and mean for the GDP per capita variable
gapminder_df %>% 
    summarise(n = n(), gdp_mean = mean(gdpPercap))
#> # A tibble: 1 x 2
#>       n gdp_mean
#>   <int>    <dbl>
#> 1  1704     7215

QUICK EXCERCISE: combine the two approach and check the average, minimum, and maximum (hint: use the mean, max() and min() functions) life expectancy by continent. Your result should be like the one below.

solution:

#> # A tibble: 5 x 5
#>   continent     n  mean maximum minimum
#>   <fct>     <int> <dbl>   <dbl>   <dbl>
#> 1 Africa      624  48.9    76.4    23.6
#> 2 Americas    300  64.7    80.7    37.6
#> 3 Asia        396  60.1    82.6    28.8
#> 4 Europe      360  71.9    81.8    43.6
#> 5 Oceania      24  74.3    81.2    69.1

2.3 Exploration by visualization

Using data visualization is a great way to get acquinted with your data and sometimes it makes more sense than looking at large tables. In this section we get into the ggplot2 package which we’ll use throughout the class. It is the cutting edge of R’s data visualization toolset (not just in academia, but in business and data journalism as well).

2.3.1 Introduction to ggplot2

We will spend most of our time using ggplot2 for visualizing in the class and I would personally encourage the course participants to stick to ggplot2. If for some reason you would like a non ggplot way of plotting in R, there is a section on base R plotting at the end of this notebook.

The name stands for grammar of graphics and it enables you to build your plot layer by layer and having the ability to control every detail of the output (if you so wish). It is used by many in academia, by Uber, StackOverflow, AirBnB, the Financial Times, BBC and FiveThirtyEight writers, among many others.

You create plots with the below syntax:

Source: Kieran, Healy. Data Visualisation: A Practical Introduction. PRINCETON University Press, 2018. (Ch.3)

To have some idea about our variables, lets plot them on a histogram. First, we examine the GDP per capita variable from our gapminder dataset. To this, we just use the geom_histogram() function of ggplot2. It gives a bare-bones histogram of the (frequency distribution of our choosen continous variable) of the choosen variable.

Let’s create the foundation of our plot by specifying for ggplot the data we use and the variable we want to plot.

ggplot(data = gapminder_df,
       mapping = aes(x = gdpPercap))

We need to specify what sort of shape we want our data to be displayed. We can do this by adding the geom_histogram() function with a +

ggplot(data = gapminder_df,
       mapping = aes(x = gdpPercap)) +
  geom_histogram()

Looks a little bit skewed. Let’s log transform our variable with the scale_x_log10() function.

ggplot(data = gapminder_df,
       mapping = aes(x = gdpPercap)) +
  geom_histogram() +
  scale_x_log10()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As the message says, we can mess around with the binwidth argument, so let’s do that.

ggplot(data = gapminder_df,
       mapping = aes(x = gdpPercap)) +
  geom_histogram(binwidth = 0.05) +
  scale_x_log10()

Of course if one prefers a boxplot, that is possible as well. We will check how life expectancy varies between and within continents. We’ll use geom_boxplot(). In this approach, we create an object for our plot. You don’t need to do this (there are instances where it is useful), but this shows you that just about everything can be an object in R

p_box <- ggplot(data = gapminder_df,
                mapping = aes(x = continent,
                              y = lifeExp)) +
  geom_boxplot()

p_box

Interpretation of the box plot is that the following. The box contains 50% of the values, the whiskers are the minimum and maximum values without the outliers, the line inside the box is the median. The upper and lower edges of the box are the first and third quartiles, respectively.

In visual form:

Source: Wickham, Hadley, and Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data. " O’Reilly Media, Inc.“, 2016.

To use barplot instead we just simply switch to the geom_bar().

ggplot(data = gapminder_df,
                mapping = aes(x = continent,
                              y = lifeExp)) +
  geom_bar()
#> Error: stat_count() must not be used with a y aesthetic.

Ooops. ggplot’s geom_bar() wants to carry out a counting excercise that is not able to run if we have a y variable specified. We can solve this in two ways. First, let’s tell ggplot that it should not do any additional calculations, by specifying it with the stat = "identity".

# using stat = "identity"
ggplot(data = gapminder_df,
       mapping = aes(x = continent, y = lifeExp)) +
  geom_bar(stat = "identity")

Second, we can use the geom_col geom, which already assumes the “identity” argument.

ggplot(data = gapminder_df,
                mapping = aes(x = continent,
                              y = lifeExp)) +
  geom_col()

Let’s use the gapminder dataset we have loaded and investigate the life expectancy and gdp per capita variables. We’ll use the geom_point() argument which we join to the p1 object with a +.

In p1 we specify the data source and the variables we want to plot. To this object we can attach the desired representation style with the geom_ functions.

p1 <- ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp))


p1 + geom_point()

Let’s refine this plot slightly: add labels, title, caption, and also transform the GDP variable. (plus some other minor cosmetics)

Check the comments in the code snippet to see what each line does!

p1 + geom_point(alpha = 0.25) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
    scale_x_log10() + # rescale our x axis
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder")

So far so good. With some minor additions the plot looks all right. But what if we want to see how each continent fares in this relationship? We need to change the p1 object to include a new argument in the mapping function: color = variable. Now it is clear that European countries (country-years) are clustered in the high-GDP/high life longevity upper right corner.

p1_grouped <- ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) # this is where we specify that we want to color the data by continents.

p1_grouped + geom_point(alpha = 0.75) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
    scale_x_log10() + # rescale our x axis
    labs(x = "GDP per capita (log $)", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder dataset")

When we are done with our nice figure, we can save it as well. I’d suggest to always save with code, and never from the “plots” pane on the right.

ggsave("gapminder_scatter.png", dpi = 600) # the higher the dpi, the smoother your plot'll look like.

We can see how life expectancy changed in Mexico, Afghanistan, Sudan and Slovenia by using the geom_line() geom. For this, we create a new dataset by subsetting the gapminder one. The %in% operator does the same thing as the == but for multiple values. For subsetting we use the dplyr::filter() function. Don’t worry if this sounds too much, we will spend a whole session on how to subset and clean our data.

#subset the dataset to have our selected countries.
comp_df <- gapminder_df %>% 
    filter(country %in% c("Mexico", "Afghanistan", "Sudan", "Slovenia"))
#> Warning: package 'bindrcpp' was built under R version 3.4.4

# create the ggplot object with the data and mapping info
ggplot(data = comp_df,
                  mapping = aes(x = year,
                                y = lifeExp,
                                color = country)) +
  geom_line(aes(group = country)) # we need to tell ggplot that we want to group our lines by countries

ggplot2 makes it easy to create individual subplots for each category by “faceting” our data. Let’s plot the growth in life expectancy over time on each continent. We use the geom_line() function to draw a line and we tell ggplot to facet by adding the facet_wrap(~ variable) function.

ggplot(data = gapminder_df,
                  mapping = aes(x = year,
                                y = lifeExp)) +
  geom_line(aes(group = country)) + # we need to tell ggplot that we want to group our lines by countries
  facet_wrap(~ continent) # create a small graph for each continent

2.3.2 The base R toolkit (OPTIONAL)

To have some idea about our variables, lets plot them on a histogram. First, we examine the GDP per capita variable from our gapminder dataset. TO this, we just use the hist() function. It gives a bare-bones plot of the histogram (frequency distribution of our choosen continous variable)

# histogram indicates that our data is not normally distributed.
hist(gapminder_df$gdpPercap)


# maybe transforming it would help
hist(log(gapminder_df$gdpPercap))

If the plot is shared with some collaborators, maybe some formatting could help. We can change the the ‘breaks’ to have a more granular view of our data. To turn off scientific notation, let’s use the options(scipen=5).

hist(gapminder_df$gdpPercap, breaks = 30)

options(scipen=5)

# you can also control the breakpoints beyond giving it a simple value.
hist(gapminder_df$gdpPercap, breaks = seq(0, 120000, 1500))

For categorical variable, we should use a barplot. To see how many observation we have for each continent, let’s use the barplot. We can use the table() function to see how many observations there are in each category and then plot that object.

cont_table <- table(gapminder_df$continent)

barplot(cont_table, main = "Number of observations by continents")

For continuous data, we can use the generic plot() function.

# to check how life expectancy changed in Mexico, we can subset our dataset and then plot the result. More on subsetting tomorrow!

mexico <- subset(gapminder_df, country == "Mexico")

# the first argument of the plot is the x axis, second is the y axis. 
plot(mexico$year, mexico$lifeExp)

Something is not quite right…

# check how we can change the dots to a line.
# ?plot

# We should add `type`. Also we can structure our function to be a bit more readable.

plot(mexico$year, mexico$lifeExp, 
     xlab = "Year", 
     ylab = "Life expectancy (Years)",
     col = "orange",
     type = "l")

plot(log(gapminder_df$gdpPercap),
     gapminder_df$lifeExp,
     xlab = "Log GDP per capita",
     ylab = "Life expectancy",
     main = "Relationship between GDP and life expectancy",
     type = "p")

To see how each variable is associated with the other, we can use the pair() function. It will create a scatter plot matrix, where the variable name is on the diagonal. In the rows the vertical axis is the variable indicated in the diagonal and in the columns the horizontal axis is the variable indicated in the diagonal.

pairs(gapminder_df[,c("lifeExp", "pop", "gdpPercap")])

We can make the plots a little nicer, by adjusting the points. (this can be done for the previous scatterplots as well!). pch = 19 selects a point type which is filled, then with col = rgb() we can specify the exact color that we are looking for. The alpha argument regulates the transparency of the points.

source: http://kktg.net/sgr/wp-content/uploads/2014/02/fig-15-3-pch-values.png

pairs(gapminder_df[,c("lifeExp", "pop", "gdpPercap")],
      pch=19, 
      col=rgb(0,0,0, alpha=0.1))

The formula for the boxplot is instructing R to create a boxplot for each continent for the numerical variable of lifeExp. In general term, the formula is y ~ group.

boxplot(lifeExp ~ continent, data = gapminder_df,
        main = "Distribution of life expectancy by continent",
        xlab = "Continents",
        ylab = "Life expectancy")

We can also do a grouped bar plot. For this we will use the star wars dataset and do a grouped bar plot of the eye color and gender in star wars movies. The logic is the same as for the barplot with a single variable, but now we add two for the table() function. We also add a legend to our plot with the legend() function.

# grouped bar plot

# load the data
sw <- starwars


# create the table with the two variables of interest
sw_table <- table(sw$gender, sw$eye_color)

sw_table
#>                
#>                 black blue blue-gray brown dark gold green, yellow hazel
#>   female            2    6         0     5    0    0             0     2
#>   hermaphrodite     0    0         0     0    0    0             0     0
#>   male              7   13         1    16    1    1             1     1
#>   none              1    0         0     0    0    0             0     0
#>                
#>                 orange pink red red, blue unknown white yellow
#>   female             0    0   0         1       1     1      1
#>   hermaphrodite      1    0   0         0       0     0      0
#>   male               7    1   2         0       2     0      9
#>   none               0    0   1         0       0     0      0


# plot the grouped barplot. Mind the `beside` argument! We can also add a legend, so the colors are straightforward.
barplot(sw_table,
        beside = TRUE,
        col = c("orange", "brown", "green", "black"))
legend(x = "topright",
       legend = c("Female", "Hermaphrodite", "Male", "None"),
       fill = c("orange", "brown", "green", "black"),
       cex = 0.7)

If you want to have the proportional numbers, we can use the prop.table() function. If we use the sw_table table as the only argument, the we will get proportions of the total. If we use 1 as the second argument, we get proportions of rows and if we use 2, we get proportions of columns. Remember, rows come first and columns second!